The latest version of this document can be found on the Marioni Lab GitHub page. This repository also contains instructions to obtain the raw data required to reproduce the results shown below.
library(ggplot2)
library(reshape2)
library(parallel)
library(Matrix)
library(pheatmap)
library(RColorBrewer)
library(reshape2)
library(scales)
library(ggsignif)
library(viridis)
library(limSolve)
library(BiocStyle)
library(DropletUtils)
library(here)
library(knitr)
library(cowplot)
folder_location = here()
ncores= 16
With the rapid increase in throughput of next-generation sequencing technologies, an individual run of a typical sequencing machine (such as those produced by Illumina) generates many more reads than is necessary for interrogating single libraries generated by most functional genomic assays. To make efficient use of these machines, DNA libraries are typically pooled together prior to sequencing, in a process known as “multiplexing”. Briefly, unique barcodes are ligated onto the ends of the DNA molecules within each library before pooling. This incorporates a known sequence into each read, allowing the assignment of reads to their libraries of origin after sequencing. Multiplexing also ensures that technical effects are consistent across samples, avoiding batch effects between sequencing lanes or flow cells; and can provide robustness against the failure of sequencing lanes, which would otherwise result in the loss of entire samples. As such, multiplexing is widely considered to be standard practice for many sequencing experiments, and is essential for cost-effective analysis of small libraries such as those in single-cell RNA-sequencing studies.
The most recent DNA sequencing machines released by Illumina (HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to improve throughput and cost efficiency. On these new flow cells, the process of “seeding” DNA molecules into the patterned wells and amplification of the seeded DNA occur simultaneously. These machines have been in use for several years in a diverse range of genomic fields. However, it has been recently reported that the use of these machines can lead to the mislabelling of DNA molecules with the incorrect library barcode (Sinha et al. 2017). The mislabelling is likely driven by the extension of free barcode molecules using other DNA molecules as a template (Figure 1). The phenomenon has been acknowledged by Illumina (Illumina 2017), although estimates of swapping fractions vary between reports (Costello et al. 2017). It is unclear whether a permanent solution to the problem will be forthcoming as rapid amplification after seeding is critical to the operation of the patterned flow cell machines (Sinha et al. 2017).
Figure 1: Schematic of the mechanism of barcode swapping on the HiSeq 4000
(As proposed by Sinha et al (2017).)
This “swapping” of barcode labels (also called “hopping” or “switching”) is problematic for analyses of sequencing data. Reads labelled with a barcode specific to a given sample may have originated from any other multiplexed sample in the same pool, compromising the interpretation of the sample labels and their use in downstream analyses. This phenomenon is particularly relevant for single-cell -omics assays, where a large number of samples (i.e., cells) are necessarily multiplexed together for efficient use of sequencing resources. Our manuscript (“Detection and removal of barcode swapping in single-cell RNA-seq data”) quantifies the effect of barcode swapping in a variety of single-cell RNA-seq datasets. We also show that swapping can create artefactual cell libraries in droplet scRNA-seq experiments. Finally, we have implemented a strategy to eliminate the effects of barcode swapping in droplet-based datasets without unnecessarily excluding data.
This Supplementary Materials file steps through all the analyses that were performed in the manuscript. All code chunks are accessible by clicking on the Code buttons below.
Before continuing, we define a few terms:
Swapping occurs between a donor library, from which the transcript originated, and recipient libraries, in which the swapped read is detected after sequencing.
The swapping fraction is defined as the fraction of reads that have been mislabelled, from the set of all cDNA-derived reads in a pool of multiplexed libraries sequenced on a single flow cell lane.
libs <- read.table(paste0(folder_location, "/data/richard_4000.tab"), header = TRUE, stringsAsFactors = FALSE)
libs_2500 <- read.table(paste0(folder_location, "/data/richard_2500.tab"), header = TRUE, stringsAsFactors = FALSE)
# There's an empty well on one of the plates too, but it contains ERCCs. It is therefore an expected combination
libs$expected[libs$empty_well] <- TRUE
libs_2500$expected[libs_2500$empty_well] <- TRUE
We consider two 96-well plates of single-cell RNA-seq libraries for mouse T-cells. We used dual indices for cell labelling, i.e., a different barcode was used at each end of the molecule. The barcodes used for each plate are from mutually exclusive sets - any barcode from one plate was never used on the other (Figure 2). For sequencing, all cell libraries from the two plates were multiplexed.
ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = expected | empty_well), col = "grey50") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y="Barcode 2") +
scale_fill_manual(values = c("FALSE" = "grey30", "TRUE" = "coral"), name = "", labels = c("TRUE" = "Plate of single cells", "FALSE" = "Unused barcode combination"))
Figure 2: Overview of the experimental design in the Richard dataset
Each position of the plot represents a barcode combination. Each of the orange blocks represents at 96-well plate. One barcode combination (N729,S522) in one of the plates did not contain a cell, but did contain barcodes and spike-in transcripts. Barcode combinations in the grey positions were not used, and thus should not contain sequencing reads.
Cells were prepared for single-cell RNA-seq by the SmartSeq2 protocol (Picelli et al. 2014) with the following modifications. Cells were sorted into 96-well plates holding 4 \(\mu\)L of lysis buffer composed of 2.3 U SUPERase In RNase inhibitor (Thermo Fisher Scientific), 0.11 % (v/v) Triton X-100 (Sigma), 12.5 mM DTT (Thermo Fisher Scientific), and 2.5 mM dNTP mix (Thermo Fisher Scientific). 1 \(\mu\)L annealing mix containing diluted ERCC RNA Spike-In Mix (Thermo Fisher Scientific) and 10 \(\mu\)M oligo-dT30VN (Sigma) was added to each well before reverse transcription with SuperScript II (Invitrogen). cDNA amplification was performed with 23 PCR cycles and the resulting PCR products purified with Ampure XP Beads (Agencourt) at a volume ratio of 0.7:1 beads:DNA. Libraries were prepared using Nextera XT DNA Sample Preparation Kit and indexes from the Nextera XT Index Kit v2 Set A and Nextera XT Index Kit v2 Set D (Illumina). Each plate of libraries was pooled, cleaned with Ampure XP beads, and quantitated using KAPA Library Quantification Kit (Roche). Equimolar quantities of libraries from each plate were combined for sequencing.
Reads were demultiplexed allowing for any of the barcode combinations shown in Figure 2 (including the impossible combinations). Read mapping was performed using the Subread aligner (v1.5.1) (Liao, Smyth, and Shi 2013) to the Ensembl mm10 genome with ERCC92 annotation. We used a Phred offset of 33 and only considered uniquely mapped reads, with default values for all other parameters. We counted the number of reads mapped to each gene in each cell using the featureCounts function (Liao, Smyth, and Shi 2014) in the Rsubread package with default options (except for minMQS=10, to retain only high-quality alignments). This assigned reads to exonic regions of each gene in the Ensembl mm10 GTF file, or to spike-in transcripts in the Thermo-Fisher ERCC92 GTF (see https://github.com/MarioniLab/CommonResources for details).
In the absence of barcode swapping, it should be impossible to observe mapped reads with barcodes from each of the two different plates, i.e., there should be no mapped reads in the grey areas of Figure 2. We will refer to these barcode combinations as “impossible” combinations, in comparison to the expected combinations in orange. For the expected combinations where cells have been loaded, there are many mapped reads (Figure 3) as expected. In the impossible barcode combinations, we observe a lower but non-zero number of mapped reads, consistent with the presence of barcode swapping.
ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = log10(mapped))) +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_viridis(name = "log10\nmapped\ncounts") +
labs(x="Barcode 1", y="Barcode 2")
Figure 3: Number of mapped reads per barcode combination, coloured on a log10 scale
medfrac.mapped <- median(libs$mapped[!libs$expected])/median(libs$mapped[libs$expected])
totfrac.mapped <- sum(libs$mapped[!libs$expected])/sum(libs$mapped[libs$expected])
medfrac.all <- median(libs$reads[!libs$expected])/median(libs$reads[libs$expected])
totfrac.all <- sum(libs$reads[!libs$expected])/sum(libs$reads[libs$expected])
The distribution of total number of mapped reads (i.e., library sizes) for all combinations are shown in Figure 4. The impossible combinations have a median mapped-read library size that is 1.5% of the median size of the expected combinations. The total number of mapped reads assigned to impossible combinations is 1.1% of that assigned to expected combinations. Note that the empty well is still considered as an expected barcode combination due to the presence of ERCC spike-in transcripts.
pdf <- melt(libs)
ggplot(data = pdf, mapping = aes(x = expected, y = value, fill = variable)) +
geom_boxplot() +
scale_y_log10(breaks = 10^(seq(3,7)), labels = c("1,000", "10,000", "100,000", "1,000,000", "10,000,000")) +
labs(x = "Real cell", y = "Library size") +
theme_bw() +
scale_x_discrete(labels = c("\"Impossible\" barcode\ncombination", "Expected barcode\ncombination"), name = "") +
scale_fill_manual(name = "Read type", values = c("royalblue2","springgreen3"), labels = c("All reads", "Mappable"))
Figure 4: Boxplots of the total number of reads for the impossible and expected barcode combinations
Dots represent barcode combinations that have totals more than 1.5 interquartile ranges from the edge of the box.
We focused on mapped reads as these are most relevant to downstream analyses. Nonetheless, we observe similar results for all reads, consistent with the ability of barcode swapping to affect all molecules on the flow cell. The median-to-median ratio of all reads (impossible:expected) is 1.8%, while the total number of reads assigned to impossible combinations is 1.6% of the total number of reads assigned to expected combinations.
We emphasize that our results are highly robust to contamination from ambient RNA or human/bacterial sources. Such contamination, which is possible on empty wells of a microwell plate, may lead to the appearance of an elevated rate of swapping, because mappable reads are detected where no cell was placed. Regardless of the amount of contamination, the impossible barcode combinations should not contain any reads, because these pairs of barcodes were never mixed in library preparation. Barcode swapping is the only possible mechanism for obtaining a substantial number of non-zero reads for these combinations. (We ignore the possibility of barcode sequencing errors causing misassignment of reads, which would be extremely unlikely for 8 bp barcodes that are well separated in base space.) This provides a point of difference for our experimental design compared to that of Sinha et al. (2017).
Denote each barcode combination as \((i, j)\) where barcode \(i \in {1,\dots,16}\) represents a row in Figure 2 (with \(i=1\Rightarrow \text{S522}, i=16\Rightarrow \text{S502}\)) and barcode \(j \in {1,\dots,24}\) represents a column (\(j=1\Rightarrow \text{N701}, j=24\Rightarrow \text{N729}\)). Let \(M_{i,j}\) denote the number of seeded cDNA molecules that truly originate from this combination, and let \(X_{i,j}\) denote the number of mapped reads. \(M_{i,j}\) therefore represents the true source of reads, while \(X_{i,j}\) represented the reported source, after swapping. Impossible barcode combinations are those with \(1 \leq i \leq 8, 1\leq j \leq 12\) or \(9\leq i \leq 16, 13\leq j \leq 24\), and have \(M_{i,j}=0\) by definition.
We assume that barcode swapping is rare, so it is unlikely that one molecule will undergo more than one round of swapping. This means that reads will only be transferred between combinations that already share a single barcode. This is illustrated in Figure 5. For that example, the combination N719 and S505 (red position) would receive swapping contributions from the cells in the blue positions.
is_target <- libs$bc1 == "N719" & libs$bc2 == "S505"
avail <- (libs$bc1 == "N719" | libs$bc2 == "S505") & libs$expected
schem_cols = rep("black", nrow(libs))
schem_cols[libs$expected] = "grey40"
schem_cols[is_target] = "indianred"
schem_cols[avail] = "cornflowerblue"
col_index = unique(schem_cols); names(col_index) = col_index
ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = as.character(cols))) +
scale_fill_manual(values = col_index) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y = "Barcode 2")
Figure 5: A schematic of the expected barcode combinations that are potential donor libraries (blue) for swapping into a recipient library (red), an impossible combination (S505/N719)
Only a single barcode needs to be swapped for transcripts in the blue combinations to appear as reads in the red combintion. Used barcode pairs are shown in grey, and unused barcode pairs are shown in black.
Let \(\tau\) be the conversion rate of seeded cDNA molecules to mapped reads in the same cell library. This is probably less than 1 due to the presence of unmappable sequences (e.g., transcribed repeats). Moreover, a seeded PCR duplicate of a cDNA molecule (formed on the flow cell) would normally count as a new read for the gene in the original cell. However, if the duplicate molecule has swapped its barcode, the cell has effectively “lost” this additional read. This will further decrease the value of \(\tau\).
We further assume that the number of observed swapped reads is proportional to the number of molecules that are available for swapping. Define \(\rho\) as the rate of swapping from any single donor library to any single recipient library, i.e., the proportion of molecules in the donor library that appear as mislabelled reads in the recipient. For each barcode combination, the number of reads can be modelled as
\[\tilde{X}_{i,j} = \tau M_{i, j} + \rho \left( \sum_{k \neq j} M_{i,k} + \sum_{l \neq i} M_{l,j}\right) \;. \]
As barcode swapping is rare, \(\rho\) should be very low such that \(\tilde{X}_{i,j} \approx \tau M_{i,j}\) for the expected barcode combinations where \(M_{i,j}>0\). We further approximate \(\tilde{X}_{i,j}\) for these expected combinations by replacing it with the observed \(X_{i,j}\). This means that, for each impossible combination \((i^*, j^*)\), we have
\[\tilde{X}_{i^*,j^*} \approx \frac{\rho}{\tau} \left( \sum_{k \neq j^*} X_{i,k} + \sum_{l \neq i^*} X_{l,j}\right) \;. \]
This represents a linear relationship between the library size for each impossible combination and the sum of the library sizes for all expected combinations with which it shares a single barcode. We estimate the parameters of this relationship by fitting a line to each \(X_{i^*,j^*}\) against the corresponding sum using ordinary least squares, as shown in Figure 6.
# Melting the data frame purely to make casting it easier
molten = melt( libs[libs$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
# Casted to make easier to rowSums/colSums to get the contributions per barcode
casted = acast( molten , bc1~bc2 )
nsums = rowSums(casted, na.rm = T)
ssums = colSums(casted, na.rm = T)
unexp = libs[!libs$expected, ]
# Now make the data frame from unexpected combinations for model fit
cell_outs = lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
col = ssums[unexp$bc2[x]],
row = nsums[unexp$bc1[x]]))
df = do.call(rbind, cell_outs)
df$tot = df$col + df$row
mod_mapped = lm(df$mapped ~ df$tot)
ggplot(df, aes(x = tot, y = mapped)) +
geom_point(col = "grey50") +
geom_smooth(method = "lm", se = FALSE, col = "black", formula = y ~ x) +
# scale_x_log10() + scale_y_log10() +
labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
theme_bw() +
scale_x_continuous(breaks = c(5e6, 1e7, 1.5e7), labels = c("5,000,000", "10,000,000", "15,000,000")) +
scale_y_continuous(breaks = seq(2500, 10000, 2500), labels = c("2,500", "5,000", "7,500", "10,000"))+
theme(axis.text = element_text(face = "bold", size = 11),
axis.title.y = element_text(size = 13, face = "bold"),
axis.title.x = element_text(size = 13, face = "bold")) +
annotate("text", x = 5500000, y = 8000, label = paste0("italic(R)^2 == ", format(summary(mod_mapped)$r.squared, digits = 2)), parse = TRUE, size = 9)
Figure 6: Relationship between the library size for each impossible combination and the sum of library sizes for all expected combinations sharing a single barcode in the HiSeq 4000 data
Each point represents an impossible barcode combination. The line of best fit is shown along with the coefficient of determination.
#retain for later use
df_4000 = df
We estimate the total number of mislabelled reads across all combinations to be
\[ \begin{aligned} & \rho \sum_{i=1}^{16} \sum_{j=1}^{24} \left( \sum_{k \neq j} M_{i,k} + \sum_{l \neq i} M_{l,j}\right) \\ & = 38 \rho \sum_{i=1}^{16} \sum_{j=1}^{24} M_{i,j} \\ & = 38 \rho \sum_{(i,j) \in \mathcal{E}} M_{i,j} \\ & \approx \frac{38 \rho}{\tau} \sum_{(i,j) \in \mathcal{E}} X_{i,j} \end{aligned} \]
where \(\mathcal{E}\) is a set of all expected combinations. The multiplication by 38 is due to the fact that there are 38 available destinations for a single-barcode-swapped read from any single expected combination (Figure 7). In other words, we sum each \(M_{i,j}\) 38 times in the first line of the above expression. This includes the impossible barcode combinations as well as the real cell combinations.
is_target <- libs$bc1 == "N705" & libs$bc2 == "S505"
avail <- (libs$bc1 == "N705" | libs$bc2 == "S505") #& libs$expected
schem_cols <- rep("black", nrow(libs))
schem_cols[libs$expected] <- "grey40"
schem_cols[avail] <- "cornflowerblue"
schem_cols[is_target] <- "indianred"
names(col_index) <- col_index <- unique(schem_cols)
ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = as.character(cols))) +
scale_fill_manual(values = col_index) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y = "Barcode 2")
Figure 7: Schematic illustrating the contribution of a donor library (red) to each of 38 recipient libraries that share a single barcode (blue)
Cell-loaded combinations are shown in grey, and unloaded combinations are shown in black.
factor <- sum(libs$mapped[libs$expected])/sum(libs$mapped)
rate_estimate <- coef(mod_mapped)[2] * 38 * factor
rate_error <- coef(summary(mod_mapped))[2,2] * 38 * factor
To obtain the swapping fraction in this experiment, we divide by the total number of mapped reads:
\[ \frac{38 \rho}{\tau} \left( \frac{\sum_{(i,j) \in \mathcal{E}} X_{i,j}}{\sum_{i=1}^{16} \sum_{j=1}^{24} X_{i,j}} \right) \]
This yields an estimated swapping fraction of 2.180 \(\pm\) 0.0765%. Notably, this is higher than the median-to-median fraction of 1.5%. This is because the median-to-median fraction only considered swapped reads arriving in the unexpected barcode combinations, whereas this slope-estimated value considers reads swapping across the entire set of barcode combinations.
Note that we fitted the line in Figure 6 with an intercept term. The value of the intercept is -427.5 \(\pm\) 170.0. This is close to zero compared to a median library size of 4123 for the unexpected combinations, consistent with our model for \(\tilde{X}_{i^*,j^*}\).
The exact same pool of multiplexed libraries was also sequenced on a HiSeq 2500. This provides a negative control dataset where barcode swapping should not be present (or, at least, present at a lower rate than encountered in the HiSeq 4000). We see strong correlation between library sizes from the two machines, as shown in Figure 8.
libs <- libs[order(libs$bc1, libs$bc2),]
libs_2500 <- libs_2500[order(libs_2500$bc1, libs_2500$bc2),]
if(!all( (libs$bc1 == libs_2500$bc1) & (libs$bc2 == libs_2500$bc2) )){
print("barcodes don't match")
}
ggplot(data = data.frame(hiseq_4000 = libs$mapped, hiseq_2500 = libs_2500$mapped, expected = libs$expected),
mapping = aes(x=hiseq_4000, y = hiseq_2500, col = factor(expected, levels = c(TRUE, FALSE), labels = c("Real cell", "\"Impossible\" barcode\ncombination")))) +
geom_point(alpha = 0.7) +
scale_x_log10( breaks = 10^(4:6), labels = c("10,000", "100,000", "1,000,000"), name = "HiSeq 4000 Library size") +
scale_y_log10( breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000"), name = "HiSeq 2500 library size") +
scale_color_brewer(palette = "Set1", name = "") +
theme_bw()
Figure 8: Library sizes for all expected (red) and impossible (blue) barcode combinations in the HiSeq 2500 and 4000 data
The small library size red point is the spike-only well.
medfrac.2500 <- median(libs_2500$mapped[!libs_2500$expected])/median(libs_2500$mapped[libs_2500$expected])
totfrac.2500 <- sum(libs_2500$mapped[!libs_2500$expected])/sum(libs_2500$mapped[libs_2500$expected])
In the HiSeq 2500 data, many fewer reads are present in the unexpected barcode combinations. The unexpected combinations have a median library size of 0.11% the median size of the expected combinations (compared to 1.5% from the HiSeq 4000). Considering all mapped reads on the plate, there are 0.082% as many in the unexpected combinations as in the expected ones (compared to 1.1% from the HiSeq 4000).
We applied the same model to the HiSeq 2500 data as described above for the HiSeq 4000 data. Results are shown in Figure 9.
molten <- melt( libs_2500[libs_2500$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
casted <- acast( molten , bc1~bc2 )
nsums <- rowSums(casted, na.rm = TRUE)
ssums <- colSums(casted, na.rm = TRUE)
unexp <- libs_2500[!libs_2500$expected, ]
cell_outs <- lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
col = ssums[unexp$bc2[x]],
row = nsums[unexp$bc1[x]]))
df <- do.call(rbind, cell_outs)
df$tot <- df$col + df$row
mod_mapped_2500 <- lm(df$mapped ~ df$tot)
ggplot(df, aes(x = tot, y = mapped)) +
geom_abline(slope = coef(mod_mapped_2500)[2], intercept = coef(mod_mapped_2500)[1], lwd = 1.5 ) +
geom_point(col = "grey50") +
labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
theme_bw() +
scale_x_continuous(breaks = c(5e6, 1e7), labels = c("5,000,000", "10,000,000")) +
scale_y_continuous(breaks = c(0, 600, 1200), labels = c("0", "600", "1,200"), limits = c(0,1250))+
theme(axis.text = element_text(face = "bold", size = 11),
axis.title.y = element_text(size = 13, face = "bold"),
axis.title.x = element_text(size = 13, face = "bold"))
Figure 9: Relationship between the library size for each impossible combination and the sum of library sizes for all expected combinations sharing a single barcode in the HiSeq 2500 data
The line of best fit is shown along with the coefficient of determination.
factor <- sum(libs_2500$mapped[libs$expected])/sum(libs_2500$mapped)
rate_estimate_2500 <- coef(mod_mapped_2500)[2] * 38 * factor
rate_error_2500 <- coef(summary(mod_mapped_2500))[2,2] * 38 * factor
Interestingly, we still observe the swapping pattern in the HiSeq 2500 data. However, the estimated swapping fraction is approximately an order of magnitude lower: 0.223 \(\pm\) 0.00955% on the HiSeq 2500, compared to 2.180 \(\pm\) 0.0765% on the HiSeq 4000.
To confirm the existance of barcode swapping, we used data from another published study (Nestorowa et al. 2016), which we refer to as the Nestorowa data. 16 sets of single-cell RNA-seq libraries were each generated on a 96-well plate, pooled together and sequenced on a single lane on a HiSeq 2500 machine. At a later date, each of the same pooled libraries were sequenced on a HiSeq 4000. For each plate, exactly the same pool of libraries was used in both sequencing runs, so the only differences between the results should be caused by the sequencing machines and Poisson sampling noise (Marioni et al. 2008). This is important, because repooling of libraries may reduce the precision of our swapping fraction estimate, or introduce systematic confounding.
# Load the 4000/2500 blood data
noswap <- as.matrix(read.table(paste0(folder_location, "/data/nest_2500.tab"), header = TRUE, row.names = 1))
swap <- as.matrix(read.table(paste0(folder_location, "/data/nest_4000.tab"), header = TRUE, row.names = 1))
plate_names <- unique(substr(colnames(swap), start = 1, stop = 8))
# Soe of the cells had to be re-pooled, so not sequencing the exact same libraries. Remove these.
newpool_lanes <- which(grepl("9527", colnames(swap)) | grepl("9548", colnames(swap)) | grepl("9549", colnames(swap)) | grepl("9550", colnames(swap)))
noswap_all <- noswap[, -newpool_lanes]
swap_all <- swap[, -newpool_lanes]
# Drop spikes, should be same level in all wells --> no net swapping
noswap <- noswap[grepl("ENSMUSG", rownames(noswap)), -newpool_lanes]
swap <- swap[grepl("ENSMUSG", rownames(swap)), -newpool_lanes]
# Split into plates
blood_meta <- data.frame(cell = colnames(swap))
split_1 <- strsplit(as.character(blood_meta$cell), split = ".", fixed = TRUE)
blood_meta$plate <- sapply(split_1, function(x) x[2])
split_2 <- strsplit(sapply(split_1, function(x) x[3]), split = "_")
blood_meta$column <- sapply(split_2, function(x) x[1])
blood_meta$row <- sapply(split_2, function(x) x[2])
swap_split <- lapply(split(as.data.frame(t(swap)), blood_meta$plate), function(x) t(as.matrix(x)))
noswap_split <- lapply(split(as.data.frame(t(noswap)), blood_meta$plate), function(x) t(as.matrix(x)))
We identified the gene with the most read counts in a single cell in the first plate of cells - Igkc. This gene is almost uniquely expressed in one cell in the plate. On both the HiSeq 2500 and 4000 machines, we observe a “crosshair” pattern of expression for this gene (i.e., along the row and column of the most highly-expressing cell), as shown in Figures 10 and 11. This is the same pattern that was reported by Sinha et al. (2017) and is attributable to barcode swapping from a single donor cell library to all recipient libraries sharing a single barcode.
# Small function to get the fraction of counts on the whole plate
# in the crosshair of the maximally expressing cell
get_fraction <- function(gene, meta, counts){
expr <- counts[gene,]
cell <- names(expr)[which.max(expr)]
bc1 <- meta$column[meta$cell == cell]
bc2 <- meta$row[meta$cell == cell]
cells <- as.character(meta$cell[meta$column == bc1 | meta$row == bc2])
cells <- cells[-which(cells == cell)]
return(sum(expr[cells])/expr[cell])
}
# Generates a crosshair-like plot
plate_plot <- function(gene, meta, counts, log_transform = TRUE, plot = FALSE) {
if(nrow(meta)!=ncol(counts)){
stop("meta/counts not same size")
}
if(log_transform){
expr <- log2(counts[gene,] + 1 )
} else {
expr <- counts[gene, ]
}
mat <- matrix(NA, ncol = length(unique(meta$column)),
nrow = length(unique(meta$row)))
rownames(mat) <- unique(meta$row)[order(unique(meta$row))]
colnames(mat) <- unique(meta$column)[order(unique(meta$column))]
for (i in seq_along(expr)) {
mat[meta$row[i], meta$column[i]] <- expr[i]
}
melt <- melt(mat)
plot_out <- ggplot(melt, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value), color = "white") +
scale_fill_gradientn(colors = c("grey70", "cornflowerblue", "black"), name = "log2(count+1)") +
annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
labs(x = "", y = "")
if(plot){
print(plot_out)
}
return(list(matrix = mat, plot = plot_out))
}
plate_2500 <- plate_plot(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]])
print(plate_2500[[2]] + ggtitle("HiSeq 2500"))
Figure 10: Expression of the gene with the highest read count in any cell, across all cells on the first plate after sequencing on the HiSeq 2500
frac_2500 <- get_fraction(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]])
plate_4000 <- plate_plot(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = swap_split[[1]])
print(plate_4000[[2]] + ggtitle("HiSeq 4000"))
Figure 11: Expression of the gene with the highest read count in any cell, across all cells on the first plate after sequencing on the HiSeq 4000
frac_4000 <- get_fraction(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = swap_split[[1]])
While the crosshair pattern is present with both machines, it is clearly stronger on the HiSeq 4000. There are 1.93% as many Igkc reads in the crosshair as in the central highly-expressing cell in the HiSeq 4000 data, compared to 0.257% for the HiSeq 2500. This is consistent with the order-of-magnitude difference in the swapping fraction between the two technologies estimated from the Richard data. The increase in Igkc coverage in the crosshair with the HiSeq 4000 is clearly shown by visualizing the log2-fold change in coverage for each cell (Figure 12).
delta <- plate_4000[[1]] - plate_2500[[1]]
melt <- melt(delta)
plot_out <- ggplot(melt, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value), color = "white") +
scale_fill_gradientn(colors = c("royalblue3" ,"grey80", "indianred3"), name = "delta log2(count+1)", values = rescale(c(min(melt$value), 0, max(melt$value)))) +
annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
labs(x = "", y = "") +
ggtitle("HiSeq 4000 - HiSeq 2500")
plot_out
Figure 12: Log2-fold change of the read count of the Igkc gene in the HiSeq 4000 data over the HiSeq 2500 in every cell of the first plate
A pseudo-count of 1 was added to avoid undefined log-fold changes.
These crosshair patterns demonstrate that swapping along rows and columns is the primary mode of read transfer between libraries. This supports our assumption that swapping is a rare event. In the vast majority of cases, swapping will occur no more than once to the same molecule, restricting the transfer of reads to libraries that already share a single barcode.
To quantify the swapping fraction, we assume that the HiSeq 2500 data contains negligible amounts of barcode swapping compared to the HiSeq 4000. This is motivated by the order-of-magnitude difference between the two sequencing machines observed in the Richard data. Our assumption allows us to treat the HiSeq 2500 data as an unbiased representation of the true expression profile for each cell, unaffected by swapping.
We applied a model that identifies contributions of different cells in the HiSeq 2500 data to the swapping-affected transcriptomes of the HiSeq 4000 data. Let \(Y_{4000}\) denote the \(G\times C\) read count matrix for the HiSeq 4000 libraries, where rows are genes (for \(G\) total genes) and columns are cells (for \(C\) total cells). Let \(Y_{2500}\) denote the equivalent count matrix for the HiSeq 2500 libraries. Let \(R\) denote a \(C \times C\) matrix capturing the contribution of the HiSeq 2500 counts to the HiSeq 4000 counts. Each entry of \(R\) defines the proportion of each HiSeq 2500 library that makes up each HiSeq 4000 library. To illustrate, take the value at \((c_1, c_2)\) in \(R\), and multiply it by the number of reads for cell \(c_1\) in the HiSeq 2500 data. This represents the number of reads from a cell \(c_1\) in the HiSeq 2500 data that contribute to a cell \(c_2\) in the HiSeq 4000 data.
For a cell with a given pair of barcodes, we need to discriminate between the contribution of other cells with exactly one shared barcode and other cells with no shared barcodes. To do this, let \(R = R_0 + R_1 + R_2\) where each of \(R_0\), \(R_1\) and \(R_2\) is a matrix of the same dimensions as \(R\). The element \((c_1, c_2)\) of each matrix is defined as
\[ \begin{aligned} R_0[c_1, c_2] &= \begin{cases} \gamma & \text{if cells $c_1$ and $c_2$ share no barcodes}\\ 0 & \text{otherwise} \end{cases} \\ R_1[c_1, c_2] &= \begin{cases} \beta & \text{if cells $c_1$ and $c_2$ share exactly one barcode}\\ 0 & \text{otherwise} \end{cases} \\ R_2[c_1, c_2] &= \begin{cases} \alpha & \text{if cells $c_1$ and $c_2$ share both barcodes}\\ 0 & \text{otherwise} \end{cases} \end{aligned} \]
The value of \(\alpha\) represents the contribution of each cell in the HiSeq 2500 data to the corresponding cell in the HiSeq 4000 data. This includes the “loss” of potential reads, i.e., PCR duplicates unaffected barcode swapping, as previously discussed for \(\tau\) in the Richard analysis. The value of \(\beta\) captures the rate of row-column swapping fraction from a donor cell \(c_1\) to a recipient cell \(c_2\), while \(\gamma\) captures swapping between barcode combinations that do not share any barcodes. In the terminology of the framework used for the Richard analysis, we are using the HiSeq 2500 read counts for each donor cell as a proxy for the number of molecules available for swapping in the HiSeq 4000 data. Finally, note that all terms can be globally scaled to capture differences in sequencing depth between the HiSeq 2500 and 4000 data.
We define the relationship between the two count matrices as:
\[ Y_{4000} = Y_{2500} R + \epsilon\]
where \(\epsilon\) represents the residual error. The aim is to obtain estimates of \(\alpha\), \(\beta\), and \(\gamma\) for each plate, using information across many genes to stabilise the estimates.
Gene selection is important for the fitting of this model. We examine the expression of two genes on the first HiSeq 2500 plate in the dataset in Figures 13 and 14
rowMax <- function(mat){
return(apply(mat, 1, max))
}
# Pull out some genes for demonstrating selection
gene_vars <- apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
gene_vars <- gene_vars[order(gene_vars, decreasing = TRUE)]
plot1 <- plate_plot(gene = names(gene_vars[30]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)
print(plot1[[2]] + ggtitle("Gene A"))
Figure 13: Gene expression pattern of gene A, shown in log-counts for all cells on the first plate
plot2 <- plate_plot(gene = names(gene_vars[1000]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)
print(plot2[[2]] + ggtitle("Gene B"))
Figure 14: Gene expression pattern of gene B, shown in log-counts for all cells on the first plate
Both genes are expressed in only a subset of the cells on the plate, but gene B is expressed much more broadly than gene A. For gene B, it is harder for the model to distinguish between contributions from other cells on the row and column of a certain cell (\(\beta\)) or from all the other cells on the plate (\(\gamma\)), because many cells in both of these sets express the gene at high levels. By contrast, gene A is expressed at a high level in only very few cells, making it easier for the model to distinguish between large values of \(\beta\) and \(\gamma\). Genes expressed broadly in the manner of gene B are therefore less informative for model fitting than those expressed like gene A.
To identify the most informative genes on a single plate, we define the “information score” for each gene as the ratio of the maximum expression value to its 90th percentile. This value will be highest when only a very few cells on the plate are highly expressing the gene. We only calculate this score for genes that are present at a minimum level of 500 counts in at least one cell in the HiSeq 2500 data. It is important to have a large number of transcripts present, otherwise swapping may be too rare to detect. The information scores are plotted in Figure 15 for a few randomly chosen plates.
set.seed(42)
plates <- sample(length(noswap_split), 4)
gene_vars <- lapply(plates, function(i)
apply(noswap_split[[i]][rowMax(noswap_split[[i]])>500,], 1, function(x) max(x)/quantile(x, 0.9)))
gene_vars <- lapply(gene_vars, function(x) x[order(x, decreasing = TRUE)])
plate_ids <- lapply(1:length(gene_vars), function(x) rep(plates[x], length(gene_vars[[x]])))
indices <- lapply(gene_vars, function(x) 1:length(x))
plot_df <- data.frame(score = unlist(gene_vars), plate = paste("plate", unlist(plate_ids)), index = unlist(indices))
ggplot(plot_df, aes(y = score, x = index)) +
geom_point(size = 0.5, col = "black") +
facet_wrap(~plate, nrow = 2) +
scale_y_log10(breaks = c(10, 100, 1000), labels = c("10", "100", "1,000"), name = "max / 90% quantile") +
theme_bw() +
lims(x = c(1, 1000))
Figure 15: Top 1000 genes by the highest information scores, ranked in descending order
Genes with infinite scores are shown as points at the top of the plots. Information scores are computed separately for each gene in each plate.
We identify the top 500 genes with the highest information scores and use only the corresponding rows in \(Y_{2500}\) and \(Y_{4000}\) for downstream model fitting. The informative gene set is selected separately for each plate. We consider 500 genes to ensure that we include the infinite scores (i.e., where the 90th percentile is 0) and to ensure that we do not exclude informative genes for other plates, which may have longer tails than the distributions shown above. We have also repeated the fits using the top 250 or 1000 genes, which yield similar estimates of the \(\alpha\), \(\beta\) and \(\gamma\) parameters (see below, Figure 18). Note that the chosen subset of genes will differ between plates, but this does not affect the comparability of the parameter estimates between plates.
We emphasize that the maximum expression value on a plate does not substantially drive selection of genes (Figure 16). This means that we are not simply selecting for genes with high maximum expression. Rather, we are identifying genes where the maximum expression value is a clear outlier relative to expression in other cells on the plate.
vars_1 <- apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
maxs_1 <- rowMax(noswap_split[[1]][rowMax(noswap_split[[1]])>500,])
keep_1 <- vars_1 >= vars_1[order(vars_1, decreasing = TRUE)][500]
vars_2 <- apply(noswap_split[[2]][rowMax(noswap_split[[2]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
maxs_2 <- rowMax(noswap_split[[2]][rowMax(noswap_split[[2]])>500,])
keep_2 <- vars_2 >= vars_2[order(vars_2, decreasing = TRUE)][500]
plot_df <- data.frame(score = c(vars_1, vars_2), max = c(maxs_1, maxs_2), plate = c(rep("Plate 1", length(vars_1)), rep("Plate 2", length(vars_2))), keep = c(keep_1, keep_2))
plot_df <- plot_df[plot_df$score!= Inf,]
ggplot(plot_df, aes(x = max, y = score, col = factor(keep))) +
geom_point() +
facet_grid(. ~ plate) +
scale_x_log10(name = "Maximum gene expression count", breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000")) + scale_y_log10("Max/90% score") +
scale_color_manual(labels = c("Dropped gene", "Analysed gene"), name = "", values = c("darkgrey", "coral"))
Figure 16: Relationship between the maximum expression and the information score for each gene, using the first two plates in the dataset
We used Poisson precision weights in our model to account for the mean-variance relationship of count data. Counts were weighted by the reciprocal of the square-root of their gene’s mean expression, to ensure that the most highly-expressed genes do not dominate the least-squares fit. We fitted a constrained linear inverse model using the limSolve package, to avoid obtaining negative values of \(\alpha\), \(\beta\), and \(\gamma\). Gene subsetting and fitting of the model was performed separately for each plate of cells, so each plate uses its own informative gene set.
#Function written largely by Aaron Lun
#Key function to estimating the contributions
# ARGS:
# plate_XXXX is the counts matrix for the cells to be compared
# bc1, bc2 are vectors for the barcodes for the cells in each of the plates
# constrain == TRUE uses a constrained fit, all coef > 0
# n.genes is the number of genes to select (see above for explanation)
# sample performs random sampling of the n.genes that were selected, choosing sample genes from it.
# RETURN:
# list - params: values of alpha, beta, gamme
# - beta: matrix that defines cell-cell barcode relationships
# - count_2500: counts matrix for use in calc_rate()
get_mixing_matrix = function(plate_4000, plate_2500, bc1, bc2, constrain = TRUE, n.genes = 500, sample = NULL){
nsamples = ncol(plate_4000)
#### GENE SELECTION
if(nrow(plate_4000)>1){
#As a fraction of the 0.9 quantile
#logic being that these are the genes that show the "crosshair" patterns
top_var = apply(plate_2500, 1, function(x) max(x)/quantile(x, 0.9))
# impose an expression limit - the gene must be expressed in a cell with at least 500 counts
# (otherwise some of the methods above pick up genes expressed with, say,
# one or two counts on the whole plate, where swapping will not actually happen)
top_var = top_var[apply(plate_2500, 1, max)>500]
if(n.genes > length(top_var)){
top_genes = names(top_var)
warning(paste("Asked for more genes than qualified by abundance; instead using all", length(top_var), "abundant genes"))
} else {
top_genes = names(top_var)[order(top_var, decreasing = TRUE)][1:n.genes]
}
if(!is.null(sample)){
if(sample > n.genes)
stop("You asked to sample more genes than selected in total!")
top_genes = sample(top_genes, sample)
}
#perform subsetting
mixed <- plate_4000[top_genes,]
pure <- plate_2500[top_genes,]
} else {
#if only one gene was supplied we don't subset
#for debugging
mixed = plate_4000
pure = plate_2500
}
#establish relationships between cells RE whether they share a barcode
beta = matrix(NA, ncol(mixed), ncol(mixed))
shares_barcode = sapply(1:length(beta), function(x) grepl(bc1[row(beta)[x]], colnames(mixed)[col(beta)[x]]) | grepl(bc2[row(beta)[x]], colnames(mixed)[col(beta)[x]]))
beta = matrix(as.numeric(shares_barcode), nrow = ncol(mixed), ncol = ncol(mixed), dimnames = list(colnames(mixed), colnames(mixed)))
diag(beta) = 2
beta = beta+1
#now the parameters we are fitting are: 1 - non row/col; 2 - row/col; 3 - self
# Formulation of C_{4000} = M C_{2500} is "backwards" for a linear model - we need to solve for M, not C_{2500}
# Therefore we formulate a new design matrix in order to solve M - this section of code does this
# We have a complex set of contribution comparisons to make:
# Rows are the possible gene-cell combinations
# Columns are the sites-of-origin of counts i.e. col1 - shares no barcodes, col2 - shares one barcode, col3 - same cell
nbeta <- max(beta)
design <- responses <- vector("list", nrow(mixed))
for (g in seq_len(nrow(mixed))) {
collected.i <- collected.j <- collected.x <- vector("list", nsamples)
for (i in seq_len(nsamples)) {
collected.i[[i]] <- rep(i, nsamples)
collected.j[[i]] <- beta[i,]
collected.x[[i]] <- pure[g,]
}
collected <- sparseMatrix(i=unlist(collected.i), j=unlist(collected.j), x=unlist(collected.x))
#staying with Matrix here leads to an integer overflow of some sort
#coerce back to regular matrix
design[[g]] <- as.matrix(collected)
responses[[g]] <- mixed[g,]
}
design <- do.call(rbind, design)
#downweight by abundance, to ensure it is not only a few highly expressed genes driving a fit
if(nrow(mixed)>1){
gene_means = rowMeans(pure)
sqrts = sqrt(gene_means)
repped = unlist(lapply(sqrts, rep, ncol(mixed)))
design = sweep(design, 1, repped, "/")
responses = Matrix(unlist(responses))/repped
}
#we need to shrink variables to prevent overflow in limSolve::lsei
val = base::norm(responses, "2")
shrunk_design = design/val
shrunk_responses = responses/val
if(!constrain){
res = solve(qr(design), matrix(unlist(responses)))
out = res[,1]
} else {
res = limSolve::lsei(A = as.matrix(shrunk_design),
B = unlist(shrunk_responses),
G = diag(ncol(design)),
H = numeric(ncol(design)),
type = 2)
out = res$X
}
#return list of: parameter estimates, R matrix, 2500 counts
names(out) = c("other", "rowcol", "self")
return(list(params = out, beta = beta, count_2500 = plate_2500))
}
#utility function to extract barcodes from cell names
get_bcs = function(counts_mat){
bcs = strsplit(colnames(counts_mat), ".", fixed = T)
bc1 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[1])
bc2 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[2])
return(list(bc1, bc2))
}
bcs = lapply(swap_split, get_bcs)
run_500 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 500))
run_250 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 250))
run_1000 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 1000))
run_sample = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 1000, sample = 500))
calc_rate = function(mix_mat_list){
params = mix_mat_list$params
beta = mix_mat_list$beta
r0 = matrix(data = c(params[1], 0, 0)[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
r1 = matrix(data = c(0, params[2], 0)[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
r2 = matrix(data = c(0, 0, params[3])[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
top = sum(mix_mat_list$count_2500 %*% (r0 + r1))
bottom = sum(mix_mat_list$count_2500 %*% (r0 + r1 + r2))
return(top/bottom)
}
#Note: this is equivalent to multiplying each of the parameters by 77, 18, and 1 for each of the categories R0, R1, R2
#sum((run_500[[1]]$params * c(77, 18,1))[c(1:2)] /sum(run_500[[1]]$params * c(77, 18,1)))
rate_500 = sapply(run_500, calc_rate)
rate_250 = sapply(run_250, calc_rate)
rate_1000 = sapply(run_1000, calc_rate)
rate_sample = sapply(run_sample, calc_rate)
Across the 16 plates assayed, we acquired a distribution of estimates for each parameter in \(R\). For example, we observe values between 0 and 0.00279 for the single shared barcode contribution term \(\beta\).
To calculate the swapping fraction for each plate, we estimated the number of reads in the HiSeq 4000 libraries that were contributed from the HiSeq 2500 libraries via swapping. We include both single barcode swaps (\(R_1\)) and double barcode swaps (\(R_0\)) in this estimate. For simplicity, let us denote the summation of all elements \(x_{i,j}\) of the matrix \(X\) as
\[ \sigma \left( X \right) = \sum_i \sum_j x_{i,j} \]
The total number of swapped reads is
\[ \sigma \left( Y_{2500} (R_0 + R_1)\right) \]
We divided this by the total number of reads in the fitted model (i.e., \(\sigma(Y_{2500}R)\)) to obtain an estimate of the swapping fraction for each plate.
\[ \frac{\sigma(Y_{2500}(R_0 + R_1))}{\sigma(Y_{2500}R)} \]
Across plates, the mean swapping fraction is 2.653 \(\pm\) 0.444% (Figure 17).
hist(rate_500, breaks = 20, xlim = c(0, 0.08), main = "", xlab = "Estimated swapping fraction")
Figure 17: Estimates of the swapping fraction for all plates, using the top 500 most informative genes for each plate
We obtained similar results with the top 250 (2.581 \(\pm\) 0.4%) or 1000 (2.707 \(\pm\) 0.503%) most informative genes per plate. We also repeated the analysis with a random selection of 500 genes from the top 1000 most informative genes per plate, which yielded similar estimates (2.977 \(\pm\) 0.539%). The variance of the swapping fraction estimates for different numbers of informative genes on the same plate is smaller than the variance across plates (Figure 18), indicating that the number of informative genes used in the model fit is not the major contributor to differences in the estimates between plates.
rate_est <- c(rate_250,
rate_1000,
rate_500,
rate_sample)
genes <- c(rep(250, 16),
rep(1000, 16),
rep(500, 16),
rep("500 Sampled", 16))
ggplot(data.frame(genes = genes, rate = rate_est, plate = rep(1:16, 4)),
aes(x = factor(plate), y = rate, col = factor(genes))) +
geom_point(size = 3) +
theme_bw() +
scale_color_brewer(palette = "Set1", name = "Number\nof genes\nconsidered") +
labs(x = "Plate", y = "Swapping fraction estimate")
Figure 18: Swapping fraction estimates for all plates, using different numbers of informative genes in the model fit
‘500 Sampled’ refers to 500 genes selected at random from the top 1000 genes.
In the Richard dataset, we considered swapping along rows and columns exclusively. We use a similar approach here by considering the fraction
\[ \frac{\sigma(Y_{2500}R_1)}{\sigma(Y_{2500}R)} \]
calc_rate_rowcol <- function(mix_mat_list) {
params <- mix_mat_list$params
beta <- mix_mat_list$beta
r0 <- matrix(data = c(params[1], 0, 0)[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
r1 <- matrix(data = c(0, params[2], 0)[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
r2 <- matrix(data = c(0, 0, params[3])[beta], nrow = nrow(beta), dimnames = list(rownames(beta), colnames(beta)))
top <- sum(mix_mat_list$count_2500 %*% (r1))
bottom <- sum(mix_mat_list$count_2500 %*% (r0 + r1 + r2))
return(top/bottom)
}
rate_500_rowcol <- sapply(run_500, calc_rate_rowcol)
This yields an estimated row-column swapping fraction of 2.068 \(\pm\) 0.326%. These row-column swapping fractions are well-correlated with the total swapping fractions (Figure 19).
qplot(rate_500, rate_500_rowcol) +
theme_bw() +
geom_abline(slope = 1) +
labs(x = "Total swapping fraction", y = "Row-column swapping fraction")
Figure 19: Relationship between the total swapping fraction and the row-column swapping fraction
The identity line is shown for comparison.
The mean value of the information score for the top 500 genes (excluding Inf) is not associated with the estimated swapping fraction (Figure 20). This indicates that the expression abundance of sporadically expressed genes does not drive differences in the swapping fraction estimates.
plates <- 1:length(noswap_split)
gene_vars <- lapply(plates, function(i)
apply(noswap_split[[i]][rowMax(noswap_split[[i]])>500,], 1,
function(x) max(x)/quantile(x, 0.9)))
gene_vars <- lapply(gene_vars, function(x) x[order(x, decreasing = TRUE)])
gene_vars <- lapply(gene_vars, function(x) x[x!=Inf])
means <- sapply(gene_vars, mean)
ggplot(data.frame(mean = means, rate = rate_500),
aes(x = mean, y = rate)) +
geom_point() +
theme_bw() +
labs(x = "Mean gene score for top 500 genes (no Inf)",
y = "Swapping fraction estimate")
Figure 20: Relationship between the swapping fraction estimate from each plate and the mean information score for the top 500 genes
Finally, recall our initial assumption that the swapping fraction on the HiSeq 2500 is negligable compared to that on the HiSeq 4000. However, some swapping does occur on the HiSeq 2500, at approximately one tenth the rate as the HiSeq 4000 (as shown in the Richard data). Thus, our estimate of the swapping fraction actually represents the relative increase in swapping in the HiSeq 4000 compared to the HiSeq 2500. We can approximate the absolute swapping fraction in the HiSeq 4000 by multiplying our estimate by 1.1. This yields an swapping fraction estimate of 2.275 \(\pm\) 0.359% along rows and columns, which is very similar to the value calculated in the Richard data (2.180 \(\pm\) 0.0765%).
To attempt to understand the variance of our swapping fraction estimates, we examined the concentration of free barcodes on each plate. Using the Bioanalyzer Expert software (Figure 21), we quantified the barcode concentration in each multiplexed pool based on the area under the peak at 40-75 bp. By comparison, sequenced cDNA should fall within the peak at 400-800 bp.
Figure 21: Screenshot of an analysis of molecule lengths from a single plate, using the Bioanalyzer Expert software
Region 1 (40-75 bp) corresponds to free DNA barcode, while region 2 (400-800 bp) corresponds to cDNA that can be sequenced on the HiSeq 4000.
The molarities of each sample are shown in Figure 22, overlaid with the calculated swapping fractions.
load_bio_csv <- function(filepath){
input <- read.table(filepath, sep = ",", header = FALSE)
# Test for ranges
from_col <- which(input[1,] == "From [bp]")
test1 <- all(c(40, 400) %in% input[-1, from_col])
to_col <- which(input[1,] == "To [bp]")
test2 <- all(c(75, 800) %in% input[-1, to_col])
if(!(test1 & test2))
stop("You don't appear to have the correct ranges specified (40-75; 400-800)")
# Keep important columns
cols <- c("Average Size [bp]" = "avg_size_bp",
"Conc. [pg/\xb5l]" = "conc_pgul",
"From [bp]" = "from_bp",
"To [bp]" = "to_bp",
"Molarity [pmol/l]" = "molarity_pmoll")
input <- input[, match(names(cols), as.character(unlist(input[1,])))]
names(input) <- cols
input <- input[-1,]
for(col in seq_len(ncol(input))) {
current <- gsub(",", "", input[,col])
input[,col] <- as.numeric(current)
}
return(input)
}
# Recall that we skip the last 4 plates as these were re-pooled
dfs <- lapply(1:16, function(i) load_bio_csv(paste0(folder_location, "/data/bioanalyzer/sample", i, ".csv")))
names(dfs) <- plate_names[1:16]
plot_df <- as.data.frame(t(sapply(1:16, function(i) c(bc_mol = dfs[[i]][1, "molarity_pmoll"],
seq_mol = dfs[[i]][2, "molarity_pmoll"],
swap_rate = rate_500[i]))))
ggplot(plot_df, aes (x = bc_mol, y= seq_mol, col = swap_rate, size = swap_rate)) +
geom_point() +
scale_color_viridis(name = "Swap Rate") +
theme_bw() +
scale_size_continuous(name = "Swap Rate") +
labs(x = "Barcode molarity (pmol/L)", y = "Sequenced cDNA molarity (pmol/L)")
Figure 22: Molarities of cDNA and free barcode for each plate, quantified using the Bioanalyzer software
The size and colour of the point corresponding to each plate is determined based on its estimate of the swapping fraction.
# For later use.
conc_df <- plot_df
We do not observe any obvious association between these measures. This is demonstrated more clearly in Figure 23, where the swapping fractions are plotted directly against the molarity of free barcode.
ggplot(plot_df, aes(x = bc_mol, y= swap_rate)) +
geom_point() +
theme_bw() +
labs(x = "Barcode molarity (pmol/L)", y = "Calculated swapping fraction")
Figure 23: Estimated swapping fraction for each plate, plotted against the molarity of free barcode
model <- lm(data = plot_df, formula = swap_rate ~ I(bc_mol))
The gradient in a linear model fitted to the swapping fraction against the barcode molarity is not significantly different from 0 (p=0.427). Similarly, we do not observe any correlation between the swapping fraction and the ratio of free barcode to captured cDNA (Figure 24).
ggplot(plot_df, aes(x = bc_mol/seq_mol, y= swap_rate)) +
geom_point() +
theme_bw() +
labs(x = "Barcode molarity / Sequenced cDNA molarity", y = "Calculated swapping fraction")
Figure 24: Estimated swapping fraction for each plate, plotted against the ratio of free barcode concentration to cDNA concentration
model <- lm(data = plot_df, formula = swap_rate ~ I(bc_mol/seq_mol))
model_outlier <- lm(data = plot_df[-which.max(plot_df$bc_mol/plot_df$seq_mol),], formula = swap_rate ~ I(bc_mol/seq_mol))
Again, the slope in the fitted linear model is not significantly different from 0 (p=0.129, p=0.466 after removing the high-ratio outlier at the right).
The molarity calculations for the sequenced cDNA may contain barcode concatamers, which do not align to the mouse genome and are irrelevant to our swapping fraction estimation. These concatemers may distort the estimate for the concentration of cDNA on each plate. To overcome this, we repeat the approach used in Figure 24, but use the total number of mapped reads per plate as a proxy for the amount of sequenced cDNA (Figure 25).
seq_proxy <- sapply(1:16, function(x) sum(swap_split[[x]]))
ggplot(plot_df, aes(x = bc_mol/seq_proxy, y= swap_rate)) +
geom_point() +
theme_bw() +
labs(x = "Barcode molarity / Total aligned reads", y = "Calculated swapping fraction")
Figure 25: Estimated swapping fraction for each plate, plotted against the ratio of free barcode concentration to the total number of mapped reads
model <- lm(data = plot_df, formula = swap_rate ~ I(bc_mol/seq_proxy))
However, we still do not see a slope significantly different from 0 (p=0.452).
In summary, our calculated swapping fractions are not associated with the amount of free barcode in libraries, nor the ratio of free barcode to cDNA concentration. This contrasts with Sinha et al. (2017), who showed that titrations of increasing amounts of additional free primer (1 nM, 10 nM, 100 nM) increased the rate of barcode swapping. We note that their clearest result was that using 100nM of free barcode, which is far in excess of standard experimental quantities. Our analysis suggests that the concentration of free barcode has little bearing on the rate of swapping at typical experimental levels (2-6 nM in this data).
In Figures 10 and 11, we showed a crosshair swapping pattern for a single gene for a single plate. We now apply a model to each gene across all plates to determine whether swapping is happening consistently across genes. This differs from the previous model, which used information from many genes simultaneously to obtain a global estimate of the swapping fraction for each plate.
Let \(i\) denote the row index for a cell on a single 96 well plate (\(1 \leq i \leq 8\)) and let \(j\) denote a cell’s column index (\(1 \leq j \leq 12\)). Subsequently, let \(Y_{i,j,g}^{(t)}\) denote the read count of gene \(g\) in cell \((i,j)\) when profiled using sequencing technology \(t\), where \(t=2500\) refers to the HiSeq 2500 and \(t=4000\) refers to the HiSeq 4000. Additionally, let
\[S_{i,j,g}^{(t)} = \sum_{k=1}^{12} Y_{i,k,g}^{(t)} + \sum_{l=1}^{8} Y_{l,j,g}^{(t)} - 2 Y_{i,j,g}^{(t)}\]
represent the total read count of gene \(g\) in technology \(t\) across all other cells on this plate sharing an index with \((i, j)\).
We first consider a null model without swapping where, for cell \((i,j)\), the number of reads mapped to each gene \(g\) in the HiSeq 4000 dataset is only dependent on the number of reads mapped to the same cell in the HiSeq 2500 data. This means that
\[H_{0}: Y_{i,j,g}^{4000} = \alpha_{g} Y_{i,j,g}^{2500} + \epsilon_{i,j,g}\]
\(\alpha_g\) captures both read depth differences as well as any gene specific biases (e.g., GC content) that may change between the two sequencing machines. \(\epsilon_{i,j,g}\) represents the residual error.
We also consider an alternative model with a swapping term, where each cell in the HiSeq 4000 data receives swapped reads from (or transfers swapped reads to) cells that share exactly one barcode:
\[H_{1}: Y_{i,j,g}^{4000} = \alpha_g Y_{i,j,g}^{2500} + \beta_g \left( S_{i,j,g}^{2500} - Y_{i,j,g}^{2500} \right) + \epsilon_{i,j,g}\]
\(\beta_g\) allows the strength of the swapping term to vary between genes, so that we can determine whether all genes are consistently affected by swapping. \(S_{i,j,g}^{2500}\) represents the pool of available reads from donor libraries that can be transferred into the recipient library for cell \((i,j)\) due to swapping of one barcode. \(Y_{i,j,g}^{2500}\) represents the pool of reads that can be transferred from the donor cell \((i,j)\) to other recipient libraries. (This represents the “loss” of PCR duplicate reads due to barcode swapping.) \(\epsilon_{i,j,g}\) represents the residual error.
Models \(H_1\) and \(H_0\) are fitted by least squares to high-abundance genes, i.e., mean counts greater than 50 across all plates. This focuses on genes that have sufficient reads for swapping to clearly manifest itself. In contrast, genes with low or zero counts provide no information about the presence or absence of swapping. The use of least squares assumes a normal distribution for the errors, which is reasonable for large Poisson-distributed counts.
For each gene, evidence against \(H_{0}\) in favour of \(H_{1}\) is established by performing an F-test (as the models are nested). The use of the swapping term significantly improves the model fit to the data when \(H_{1}\) is favoured over \(H_{0}\). In addition, we expect that \(\beta > 0\), i.e., swapping transfers reads from donor libraries to recipient libraries. If \(\beta < 0\), the behaviour of swapping is inverted compared to the other work presented above.
When the model is applied, the vast majority of genes favour \(H_{1}\) with \(\beta>0\), as shown in Figure 26
#This chunk is very slow and computationally intensive. Change the min.mean setting in the line starting
# "gene_dfs = " if you need to speed it up.
#Note that these functions rely on the barcodes being in the names of the cells in a specific place
#returns the summed counts for all cells that share an index from the cell name, in the counts.
#char cell: cell name
#named vector named_count_row: a gene's row from the counts matrix, with cell names
get_summed = function(cell, named_count_row){
barcodes = strsplit(strsplit(cell, split = ".", fixed = TRUE)[[1]][3],
split = "_")[[1]]
row_bc = barcodes[2]
col_bc = barcodes[1]
#provides S_{gij} for the simplified model fit
# summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 2* named_count_row[names(named_count_row) == cell]
#provides S_{gij} - C_{gij} for the more complex model fit
#-2C for the non-self-counting, -1C for the term to fit beta to
summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 3* named_count_row[names(named_count_row) == cell]
return(summed)
}
#given a row of counts for each of the 2500 and 4000 (i.e. a gene), function prepares
#a data frame of cell, count_4000 = 4000 counts, count_2500 = 2500 counts, summed_2500 = row/col summed counts
# i.e. wraps get_summed() across cells
process_gene_on_plate = function(count_row_2500, count_row_4000){
summed = sapply(1:length(count_row_2500), function(x) get_summed(cell = names(count_row_2500[x]),
named_count_row = count_row_2500))
out = data.frame(cell = names(count_row_2500),
count_4000 = count_row_4000,
count_2500 = count_row_2500,
summed_2500 = summed)
return(out)
}
#Given plate-specific data frames, returns a data frame for each gene therein with,
# for each cell, the values of the 2500 and 4000 counts, and the summed 2500 row/column
#i.e. wraps process_gene_on_plate() across genes
process_plate = function(plate_meta, plate_4000, plate_2500){
if(any(rownames(plate_2500)!= rownames(plate_4000)))
stop("Genes not the same on both plates")
genes_4000 = split(as.data.frame(plate_4000), rownames(plate_4000))
genes_4000 = lapply(genes_4000, function(x) unlist(x[1,]))
genes_2500 = split(as.data.frame(plate_2500), rownames(plate_2500))
genes_2500 = lapply(genes_2500, function(x) unlist(x[1,]))
gene_dfs = lapply(1:length(genes_2500), function(x)
process_gene_on_plate(count_row_2500 = genes_2500[[x]],
count_row_4000 = genes_4000[[x]])
)
names(gene_dfs) = names(genes_2500)
return(gene_dfs)
}
#The function takes the full counts matrices, splits them by plate, and returns the
#dataframes ready to be modelled from, in a list
#all_meta needs $plate
#i.e. wraps process_plate() across all plates
do_all_plates = function(counts_4000, counts_2500, all_meta, min.mean = 50, n.cores = 3){
require(parallel)
keep_genes = which(rowMeans(counts_2500)>min.mean)
split_meta = split(all_meta, all_meta$plate)
split_4000 = lapply(split(as.data.frame(t(counts_4000[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
split_2500 = lapply(split(as.data.frame(t(counts_2500[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
#run over all plates
#Critically, this returns the column summed_2500, which is defined by get_summed()
# and will later be the predictor
dataframes = mclapply(seq_along(split_4000), function(x) process_plate(plate_meta = split_meta[[x]],
plate_4000 = split_4000[[x]],
plate_2500 = split_2500[[x]]),
mc.cores = n.cores)
#for each gene,make a master data frame
n.genes = length(dataframes[[1]])
gene_list = list()
for(gene in 1:n.genes){
#get the combined dataframe of the gene
gene_df = data.frame()
for(plate in 1:length(dataframes)){
gene_df = rbind(gene_df, dataframes[[plate]][[gene]])
}
gene_df$response = gene_df$count_4000
gene_df$predictor = gene_df$summed_2500
#sometimes we cannot get a predictor variable i.e. there is not any possibiltiy for swapping when
#the row/col sum= 0 (i.e. the gene does not exist for these cells)
#so we remove these now
# gene_df = gene_df[!is.nan(gene_df$predictor),]
gene_list[[length(gene_list)+1]] = gene_df
names(gene_list)[[length(gene_list)]] = names(dataframes[[1]])[gene]
}
return(gene_list)
}
#This function just makes models from its input list - from do_all_plates()
get_model_from_df_list = function(df_list){
models = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
return(models)
}
#takes the same list as get_model_from_df_list() but does the model comparison
get_anovas = function(df_list){
full_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
small_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500))
return(lapply(1:length(df_list), function(i) anova(small_mod[[i]], full_mod[[i]])))
}
#If you want to speed up this code, increase the min.mean to 500 or even higher, so fewer genes are considered
gene_dfs = do_all_plates(counts_4000 = swap, counts_2500 = noswap, all_meta = blood_meta, min.mean = 50, n.cores = ncores)
#for betas
models = get_model_from_df_list(gene_dfs)
anovas = get_anovas(gene_dfs)
#an alternative method vs. the Wald test
anova_ps = sapply(anovas, function(x) x$`Pr(>F)`[2])
anova_fdr = p.adjust(anova_ps, method = "fdr")
anova_frac = prop.table(table(p.adjust(anova_ps, method = "fdr")<0.05))[2]
gradients = sapply(models, function(x) coef(x)[2])
plot_df = data.frame(val = c("pos", "pos", "neg", "neg"),
sig = c(TRUE, FALSE, TRUE, FALSE),
values = c(sum(gradients>0 & anova_fdr<0.05),
sum(gradients>0 & anova_fdr>0.05),
sum(gradients<0 & anova_fdr<0.05),
sum(gradients<0 & anova_fdr>0.05)))
ggplot(plot_df, aes(x = factor(sig, levels = c(TRUE, FALSE), ordered = TRUE), y = values)) +
geom_bar(aes(fill = factor(val, levels = c("pos", "neg"), ordered = TRUE)), position = "dodge", stat = "identity") +
scale_x_discrete(labels = c("Model fit improvement\nwith swapping term",
"No model fit improvement\nwith swapping term"),
name = "") +
scale_y_continuous(name = "Number of genes") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
#panel.grid.minor.y = element_blank(),
axis.text = element_text(face = "bold", size = 10),
axis.title = element_text(size = 13, face = "bold")) +
scale_fill_manual(values = c("#23809C", "#7A1305"), name = "", labels = c("Coefficient sign supports swapping\n(beta > 0)", "Coefficient sign opposes swapping\n(beta < 0)"))
Figure 26: Number of genes where the alternative swapping model (\(H_1\)) offers a significantly improved fit over the null model (\(H_0\)) with a positive (blue) or negative estimate of \(\beta_g\) (red)
support_frac = sum(gradients > 0 & anova_fdr < 0.05)/sum(anova_fdr < 0.05)
counts_lanes = as.matrix(read.csv(paste0(folder_location, "/data/wilson_2500.csv")))
counts_lanes = counts_lanes[rowSums(counts_lanes)>0,]
#split to the two sub_matrices
counts_1 = counts_lanes[,!grepl("_new", colnames(counts_lanes))]
counts_2 = counts_lanes[,grepl("_new", colnames(counts_lanes))]
colnames(counts_2) = gsub("_new", "", colnames(counts_2))
bc1s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[1])
bc2s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[2])
#for the two lane 2500
#match the name format to the other plates just for this
lane_1 = counts_1; colnames(lane_1) = paste0("some.name.", colnames(lane_1))
lane_2 = counts_2; colnames(lane_2) = paste0("some.name.", colnames(lane_2))
twolane_meta = data.frame(cell = c(colnames(lane_1)), plate = c(rep(1, ncol(lane_1))))
gene_dfs_twolane = do_all_plates(counts_4000 = lane_1, counts_2500 = lane_2, all_meta = twolane_meta, min.mean = 50, n.cores = 3)
models_twolane = get_model_from_df_list(gene_dfs_twolane)
anovas_twolane = get_anovas(gene_dfs_twolane)
anova_ps_twolane = sapply(anovas_twolane, function(x) x$`Pr(>F)`[2])
anova_frac_twolane = prop.table(table(p.adjust(anova_ps_twolane, method = "fdr")<0.05))[2]
Of all tested genes, 90.5% exhibited a significant improvement in the model fit with the swapping term (adj. \(p<0.05\)). Of these significant genes, 97.9% have a positive value of \(\beta\), supporting the expected barcode swapping model. This strongly suggests that barcode swapping is more prevalent on the HiSeq 4000 compared to the HiSeq 2500, and that it affects nearly all tested genes.
As a negative control, we also fitted these models to data where identical library pools were sequenced on two lanes of a HiSeq 2500. Here, only 0.099% of genes have a significantly improved fit with the \(\beta\) term, as expected.
New single-cell RNAseq protocols use microfluidic systems to automate stages of library preparation by capturing individual cells in droplets. Each run of the microfluidic system generates a sample that typically contains thousands of cells. These droplet-based protocols label their cells in a different manner to plate-based assays, as a cell barcode (unique to each droplet) is incorporated into the transcript alongside an additional Illumina barcode that labels different sets of cells (i.e., samples). Therefore, swapping of Illumina barcodes will move transcripts between samples while retaining the same cell identifier (Figure 27).
Figure 27: Schematic of the barcoding strategy used in 10X Genomics experiments
Each captured cDNA contains multiple barcodes, including a 10X-supplied cell-labelling barcode, a randomly generated unique molecular identifier (UMI), and an Illumina-supplied sample index. Only the sample index is expected to swap, leaving the cell barcode and UMI unchanged.
As discussed in the main text, swapping of the sample barcode can have two effects depending on whether the same cell barcodes are present in both the donor and recipient samples. If they are, the expression profile for each shared cell barcode in the recipient sample will become a mixture with the corresponding profile in the donor sample. This is equivalent to the effect in plate-based assays, as discussed above for the Richard and Nestorowa datasets. Otherwise, artefactual cells may appear in the recipient sample, corresponding to the swapped-in cell barcodes from the donor sample. This manifests as an increase in the number of shared cell barcodes between samples.
Barcode swapping is additionally problematic with UMI data where multiple reads for the same captured cDNA molecule are collapsed into a single UMI count. Even a very small number of swapped reads for a molecule will constitute a single count in the recipient sample. This means that the contribution of a few swapped reads will be the same as the contribution of a molecule that is sequenced hundreds or thousands of times in its sample of origin. In this manner, the effect of barcode swapping on the recipient expression profile is effectively inflated in UMI data compared to read count data.
Here, we describe an experiment where barcode-swapping created severe technical artefacts that confounded downstream analysis. In dataset 2 (mouse epithelial cells), the sample labels combine experimental condition (A-D) with biological replicate number (1-2). Samples B1 and B2 possess considerably smaller library sizes (Figure 32), and share almost all of their cell barcodes with each other and with other samples that were multiplexed with them (Figure 33). Accordingly, samples B1 and B2 were excluded from the earlier analyses (Figure 30).
bc_tab <- table(samp2_all$barcode)
ggplot(samp2_all, aes(x = factor(SampleID), y = UmiSums, fill = substr(SampleID, 1, 1))) +
geom_violin() +
scale_y_log10(labels = c("1,000", "10,000", "100,000"), breaks = 10^(3:5)) +
theme_bw() +
theme(legend.position = "none", panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15)) +
labs(x = "", y = "Number of UMIs") +
scale_fill_brewer(palette = "Set2")
Figure 32: Distribution of UMI count across all cell barcodes in each 10X sample
Only cell barcodes that were detected by CellRanger as cell-containing droplets are shown.
samp2_all$n <- sapply(samp2_all$barcode, function(x) bc_tab[x])
samp2_all$n_cut <- samp2_all$n>1 +1
b1_barcodes <- samp2_all$barcode[samp2_all$SampleID =="B1"]
b2_barcodes <- samp2_all$barcode[samp2_all$SampleID =="B2"]
samp2_all$in_b <- samp2_all$barcode %in% b1_barcodes & samp2_all$barcode %in% b2_barcodes
ggplot(samp2_all, aes(x = SampleID)) +
geom_bar(aes(fill = factor(in_b)), stat = "count") +
scale_fill_manual(values = c("#E8CF76", "#383F70"), name = "", labels = c("Barcode not present\nin both B samples", "Barcode present\nin both B samples")) +
theme_bw() + theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 12, face = "bold"),
axis.title = element_text(face = "bold", size = 15)) +
labs(x = "Sample", y = "Number of called cells") +
scale_y_continuous(breaks = seq(0, 1250, 250), labels = c("0", "250", "500", "750", "1,000", "1,250"))
Figure 33: Number of cell barcodes in each sample that are also present in both samples B1 and B2
The number of cell barcodes in each sample that are not present in both B1 and B2 is also shown.
sample_share <- length(intersect(samp2_all$barcode[samp2_all$SampleID=="B1"],
samp2_all$barcode[samp2_all$SampleID=="B2"]))/
length(union(samp2_all$barcode[samp2_all$SampleID=="B1"],
samp2_all$barcode[samp2_all$SampleID=="B2"]))
fail_bc <- samp2_all[grepl("B", samp2_all$SampleID), "barcode"]
samp2$swapped_in <- samp2$barcode%in%fail_bc
test <- wilcox.test(samp2$UmiSums[samp2$swapped_in], samp2$UmiSums[!samp2$swapped_in])
Samples B1 and B2 share 94% of all their cell barcodes with each other, yet apparently contain more cells than any of the other samples. This behaviour is clearly unusual. We hypothesize that the cells in samples B1 and B2 were damaged prior to or during library preparation in the 10X Chromium system. This resulted in the very small library sizes observed for cells in these samples, as the actual amount of input RNA in each droplet was very low. Because the real cell libraries in the B samples were very small, barcode swapping from other samples generated artefactual libraries that were large relative to the real cells. These swapped libraries were subsequently called as cells by the CellRanger software, despite being derived purely from other samples. This is supported by the excessive sharing of barcodes between the two B samples, and their sharing with the other, high-quality samples.
Why have these small libraries been called as cells? This is due to CellRanger’s cell calling algorithm, which defines cells as “10X barcodes with a total UMI count of >= 10% of the 99th percentile of the expected recovered cells” (Zheng et al. 2017) (this is the case as of Cellranger 2.1.0, the latest version released upon submission of the manuscript). If the 99th percentile UMI count is low, many barcodes with low library sizes will be called as cells, regardless of whether the corresponding libraries are noisy, of poor quality, or derived from swapping.
Finally, not all barcodes from high-quality samples are observed in the poor quality B samples. What may drive a barcode’s presence or absence? Consider differences in library size: a large barcode library has more total cDNA in the sequencer, and will therefore contribute more swapped reads to other samples. Conversely, a smaller library will contribute fewer swapped reads. Therefore, when calling cells based on size, we would be more likely to call an artefactual cell generated by swapping from a large library.
Based on this reasoning, barcodes from high-quality samples that are also called as cells in the B samples should have larger libraries than other barcodes who are present in only the high-quality samples. Indeed, Figure 34 demonstrates that shared-barcode libraries do have significantly more molecules (p = 9.96e-259, with a ratio of 2.2 between the medians). This confirms that the cells called in B are largely derived from swapping artefacts.
ggplot(samp2, aes(x=swapped_in, y = UmiSums)) +
geom_boxplot() +
theme_bw() +
scale_y_log10(breaks = c(1000, 10000, 100000), labels = c("1,000", "10,000", "100,000"), name = "Library size", lim = c(min(samp2$UmiSums)*0.8, max(samp2$UmiSums)*1.5)) +
scale_x_discrete(labels = c("Unique cell\nbarcode", "Shared cell\nbarcode"), name = "") +
geom_signif(comparisons = list(c(1, 2)), annotations = "***") +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15))
Figure 34: Library sizes of cell barcodes in high-quality samples that are shared with B1/B2 (“Shared cell barcode”) or not (“Unique cell barcode”)
Dots represent cell barcodes that are more than 1.5 interquartile ranges from the edges of each box.
The presence of artifactual cells in this dataset is likely driven by the presence of severely compromised samples containing damaged cells. However, it is possible that the multiplexing of cells of radically different RNA content could present similar problems. We advise caution when combining experimental designs of this nature with the HiSeq 4000 (or other patterned flow-cell) sequencing.
We recommend testing for the degree of cell barcode sharing between samples as a standard quality control step in droplet-based single-cell experiments. The method we have presented above (a hypergeometric test on pairwise comparisons) is quick and easy to apply.
If problems with barcode swapping are observed, the easiest solution is to remove any cells labelled with a barcode that exists in more than one sample in each multiplexed sequencing run. This procedure will remove the artefacts, regardless of whether the cells truly exist in both samples (which will cause transcriptome mixing) or whether artefactual cells are being created. The cost of such a procedure will be the loss of some genuine cells for downstream analysis, even in the absence of swapping.
We have run simulations to understand the effects of excluding cell barcodes in the absence of swapping. We consider a simulated experiment where 8 samples are multiplexed together. Each sample consists of an equal number of cells \(n\), whose barcodes are drawn at random with replacement from 10X’s set of 737,280. Barcodes are drawn with replacement, as the pool of 10X gel beads does not contain exactly one of each barcode. Drawing of barcodes is independent between samples, assuming no creation of artefactual cells due to swapping. Barcodes are then marked for exclusion if they are observed in more than one of the eight samples. For each \(n\), 100,000 simulations were run.
Figure 35 shows the rate of cell removal for different values of \(n\). The expected rate of removal is low (< 5%) at low \(n\) corresponding to a small experiment (< 4,000 cells per sample). In such cases, cell exclusion is a simple and effective strategy for removing swapping artifacts without discarding much data. However, it is not suitable for datasets with many (\(n > 10000\)) cells per sample, where the expected rate of removal exceeds 10%. Similarly, it should not be used in cases where many samples are multiplexed for sequencing.
do_sim_remove <- function(sample_size = 10000, n.samples = 8, n.barcodes = 737280){
samples <- sapply(1:n.samples, function(x) sample(x = n.barcodes, size = sample_size, replace = TRUE))
dups <- samples[base::duplicated(as.numeric(samples))]
removal_rate <- sum(samples%in%dups)/length(samples)
return(removal_rate)
}
sizes <- c(1000, 2000, 4000, 8000, 12000, 24000)
nsim <- 100
sim_results <- lapply(sizes, function(x) sapply(1:nsim, function(y) do_sim_remove(sample_size = x, n.samples = 8)))
size_column <- factor(rep(sizes, nsim), levels = sizes, ordered = TRUE)
plot_df <- data.frame(rate = unlist(sim_results),
size = size_column[order(size_column)])
ggplot(plot_df, mapping = aes(x = size, y = rate)) +
geom_boxplot() +
labs(x= "Cells per sample", y = "Removal rate") +
theme_bw() +
scale_y_continuous(breaks = seq(0, 0.2, 0.05), labels = paste0(seq(0, 20, 5), "%"))
Figure 35: Percentage of cell barcodes that are incorrectly discarded by our cell exclusion approach in the absence of barcode swapping
This is based on simulated data for an experiment containing 8 multiplexed 10X samples with varying numbers of cells.
A more precise solution for removing swapping effects focuses on the cDNA molecules themselves. In a 10X Genomics single-cell experiment, transcripts are generated that contain:
The chance of observing a read in two different samples with the combination of same cell barcode, UMI, and gene alignment is extremely low, due to the large amount of combinatorial complexity. We therefore assume that all reads with the same combination are derived from a single original molecule.
We first considered the mouse epithelial cells sequenced on the HiSeq 4000 (i.e., dataset 2). We identified reads with the same combination of UMI, cell barcode and gene across all samples. For each combination, we calculated the fraction of all reads that were observed in each sample, and obtained the largest read fraction across all samples. Figure 36 shows the distribution of largest read fractions for all combinations that contain reads in multiple samples. Combinations where all reads existed in one sample are therefore excluded.
h5_files <- dir(paste0(folder_location, "/data/molecule_info/hiseq_4000"), full.names = TRUE)
cleaned_4000 <- swappedDrops(h5_files, get.diagnostics = TRUE, get.swapped = TRUE)
reads_4000 <- cleaned_4000$diagnostics
large_frac <- apply(reads_4000, 1, max)/Matrix::rowSums(reads_4000)
large_frac <- large_frac[large_frac!=1]
ggplot(mapping = aes(x = large_frac)) +
geom_density() +
geom_vline(xintercept = c(1/2, 2/3, 3/4)) +
geom_label(data = data.frame(lab = c("1/2", "2/3", "3/4"),
X = c(1/2, 2/3, 3/4),
Y = rep(6, 3)),
mapping = aes(x = X, y=Y, label = lab),
nudge_x = 0.0) +
theme_bw() +
labs(x = "Largest read fraction", y = "Density")
Figure 36: Distribution of the largest read fractions for all combinations that are present in multiple samples in dataset 2, sequenced on the HiSeq 4000
The periodic spikes in density at particular values are driven by the discreteness of count data. Three of these are annotated - a largest read fraction of 0.5 may indicate molecules with one read in each of two samples; a value of 0.66 indicates molecules with two reads in one sample, and one in another, and so on. The greatest density is observed at values close to 1. This represents combinations where the vast majority of reads are allocated to a single sample, presumably the sample of origin.
We can visualize the high density at large values with a cumulative distribution function (Figure 37). Over 80% of swapped molecules have a largest read fraction of above 0.8, and over 60% above 0.9.
ggplot(mapping = aes(x = large_frac)) +
stat_ecdf() +
theme_bw() +
labs(x = "Largest read fraction", y = "Cumulative distribution")
Figure 37: Cumulative distribution of the largest read fraction for all combinations that are present in multiple samples in dataset 2, sequenced on the HiSeq 4000
As a negative control, the same libraries were re-sequenced on the HiSeq 2500. Table 1 presents summary values for the two sets of data. Note that many fewer molecules are swapped on the HiSeq 2500.
h5_files_2500 = dir(paste0(folder_location, "/data/molecule_info/hiseq_2500"), full.names = TRUE)
cleaned_2500 = swappedDrops(h5_files_2500, get.diagnostics = TRUE, get.swapped = TRUE)
reads_2500 = cleaned_2500$diagnostics
tab = data.frame(row.names = c("Hiseq4000", "HiSeq2500"),
Reads = format(c(sum(reads_4000), sum(reads_2500)), big.mark = ","),
Molecules = format(c(nrow(reads_4000), nrow(reads_2500)), big.mark = ","),
"Swapped fraction" = format(c(sum(Matrix::rowSums(reads_4000>0)>1)/nrow(reads_4000),
sum(Matrix::rowSums(reads_2500>0)>1)/nrow(reads_4000)),
digits = 3))
kable(tab, caption = "Library summary statistics. Swapped fraction is the fraction of molecules (identical UMI, cell barcode, aligned gene) observed in more than one sample.", col.names = c("Reads", "Molecules", "Swapped fraction"))
| Reads | Molecules | Swapped fraction | |
|---|---|---|---|
| Hiseq4000 | 626,168,518 | 53,636,695 | 0.071033 |
| HiSeq2500 | 268,768,037 | 44,062,616 | 0.000487 |
Figures 38 and 39 overlay the HiSeq 2500 data over the HiSeq 4000 data.
large_frac_2500 <- apply(reads_2500, 1, max)/Matrix::rowSums(reads_2500)
large_frac_2500 <- large_frac_2500[large_frac_2500!=1]
ggplot() +
geom_density(mapping = aes(x = large_frac), col = "black") +
geom_density(mapping = aes(x = large_frac_2500), col = "red") +
theme_bw() +
labs(x = "Largest read fraction", y = "Density")
Figure 38: Distribution of the largest read fractions for all combinations that are present in multiple samples in dataset 2, sequenced on the HiSeq 4000 (black) or HiSeq 2500 (red)
ggplot() +
stat_ecdf(mapping = aes(x = large_frac), col = "black") +
stat_ecdf(mapping = aes(x = large_frac_2500), col = "red") +
theme_bw() +
labs(x = "Largest read fraction", y = "Cumulative distribution")
Figure 39: Cumulative distribution of the largest read fraction for all combinations that are present in multiple samples in dataset 2, sequenced on the HiSeq 4000 (black) or HiSeq 2500 (red)
We have implemented an algorithm to exclude molecules that were swapped between single-cell 10X Genomics libraries in the DropletUtils package, currently available on GitHub. We use the following steps:
Using this method, we excluded 0.715% of molecules in the HiSeq 4000 data.
To test the effect of molecule exclusion on the data, we called cells from the processed and raw count matrices. If we successfully removed swapped reads, we should eliminate the artefactual cells that we identified in Samples B1 and B2 above (see section 4.4, Artefactual cells due to barcode swapping in compromised samples). We used the emptyDrops function from the DropletUtils package to detect cells, specifiying an FDR threshold of 1% and a minimum library size of 1000 molecules. This method tests whether the expression profile for a cell barcode is significantly different from the pool of background RNA, to distinguish cell-containing and empty droplets. The results of cell calling are shown in Table 2 and visualized in Figure 40.
call_cells = function(expr.matrix){
test = emptyDrops(expr.matrix, niters = 30000, ignore = 1000)
# test = test[!is.na(test$FDR),]
sig = sum(test$FDR <= 0.01, na.rm = TRUE)
return(sig)
}
sample_map <- unique(samp2_all$SampleID)
names(sample_map) <- c("A1", "B1", "C1", "D1", "E1", "F1", "G1", "H1")
tab <- data.frame(sample = sample_map,
cells_pre = sapply(seq_along(cleaned_4000$cleaned),
function(i) call_cells(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]])),
cells_post = sapply(cleaned_4000$cleaned, call_cells))
cellcall <- tab
kable(tab, row.names = FALSE, col.names = c("Sample", "Cells called (all molecules)", "Cells called (unswapped molecules)"), caption = "Number of cells called before and after swapped molecule processing.")
| Sample | Cells called (all molecules) | Cells called (unswapped molecules) |
|---|---|---|
| C1 | 644 | 638 |
| A2 | 473 | 458 |
| D2 | 161 | 143 |
| A1 | 347 | 336 |
| B2 | 283 | 13 |
| B1 | 312 | 11 |
| C2 | 1229 | 1212 |
| D1 | 555 | 536 |
ggplot(melt(tab), aes(x = sample, fill = variable, y = value)) +
geom_bar(stat = "identity", position = "dodge") +
theme_bw() +
scale_fill_brewer(palette = "Set2", name = "", label = c("Raw counts", "Unswapped\ncounts")) +
labs(x = "Sample", y = "Number of called cells")
Figure 40: Number of cells called in each sample of dataset 2 (sequenced on the HiSeq 4000) before and after swapped molecule removal
Note that the number of called cells is different to those shown earlier in this document, as this figure uses emptyDrops while the earlier figures use the CellRanger algorithm.
n_change <- sum(tab$cells_pre[!grepl("B", tab$sample)] - tab$cells_post[!grepl("B", tab$sample)])
n_healthy <- sum(tab$cells_pre[!grepl("B", tab$sample)])
Our molecule exclusion method successfully removes the artefactual cells in B1 and B2. This suggests that, once swapped molecules are removed, the barcode libraries return to their background-like appearance and are correctly discarded by emptyDrops.
Molecule-based exclusion is preferable to cell-based exclusion, which would have resulted in the loss of all barcodes shared between B1/B2 and the other samples. However, our method still resulted in the loss of 86 cells from the other samples (2.52% of pre-correction cells). This is due to the removal of swapping contributions that inflate the total UMI count and encourage detection of potentially spurious cells. (Remember that, regardless of the number of reads, each combination will still contribute a single UMI count to a sample.)
As previously mentioned, CellRanger will always call some barcodes as cells, regardless of whether they are derived from swapping (again, as of at least version 2.1.0). As such, CellRanger fails to eliminate the artefactual cells in samples B1 and B2, even after we have removed the swapped transcripts (Figure 41). We therefore recommend emptyDrops, which is more suitable for use with our molecule exclusion approach.
tab <- data.frame(sample = sample_map,
cells_pre = sapply(seq_along(cleaned_4000$cleaned),
function(i) sum(defaultDrops(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]]))),
cells_post = sapply(cleaned_4000$cleaned,
function(x) sum(defaultDrops(x))))
# kable(tab, row.names = FALSE, col.names = c("Sample", "All molecules", "Unswapped molecules"), caption = "Number of cells called before and after swapped molecule processing.")
ggplot(melt(tab), aes(x = sample, fill = variable, y = value)) +
geom_bar(stat = "identity", position = "dodge") +
theme_bw() +
scale_fill_brewer(palette = "Set2", name = "", label = c("Raw counts", "Unswapped\ncounts")) +
labs(x = "Sample", y = "Number of called cells")
Figure 41: Number of cells called in each sample of dataset 2 (sequenced on the HiSeq 4000) before and after swapped molecule removal, using the CellRanger algorithm for cell calling
Finally, we applied our molecule exclusion method on two different datasets that were processed and sequenced separately. Thus, we should see no swapping and therefore negligible molecule removal between the two experiments. We use the HiSeq 2500 dataset described above, termed Experiment 1, and a complete replicate of the same experiment that was processed and sequenced (again on the HiSeq 2500) at a later date, termed Experiment 2. The latter dataset has been taken from Bach et al. (2017). A summary of the two experiments is shown in Table 3.
new_files <- c(h5_files_2500,
dir(paste0(folder_location, "/data/molecule_info/hiseq_2500_run2"), full.names = TRUE))
new_reads <- swappedDrops(new_files, get.diagnostics = TRUE)$diagnostics
read_present <- new_reads > 0
present_expt1 <- read_present[,1:length(h5_files_2500)]
present_expt2 <- read_present[,-(1:length(h5_files_2500))]
n_mol <- nrow(new_reads)
n_mol_expt1 <- sum(Matrix::rowSums(present_expt1)>0)
n_mol_expt2 <- sum(Matrix::rowSums(present_expt2)>0)
swapped_expt1 <- sum(Matrix::rowSums(present_expt1)>1)
swapped_expt2 <- sum(Matrix::rowSums(present_expt2)>1)
swapped_both <- Matrix::rowSums(present_expt1) > 0 & Matrix::rowSums(present_expt2) > 0
df <- data.frame("Metric" = c("Number of molecules", "Number of swapped molecules", "Swapped fraction"),
"Experiment 1" = c(n_mol_expt1, swapped_expt1, format(swapped_expt1/n_mol_expt1, digits = 3)),
"Experiment 2" = c(n_mol_expt2, swapped_expt2, format(swapped_expt2/n_mol_expt2, digits = 3)))
kable(df, caption = "Summary of the two different experiments.", row.names = FALSE)
| Metric | Experiment.1 | Experiment.2 |
|---|---|---|
| Number of molecules | 44062616 | 241978544 |
| Number of swapped molecules | 26105 | 46211 |
| Swapped fraction | 0.000592 | 0.000191 |
Here, only 688 combinations were seen in both datasets (0.000241% of all combinations), which is considerably fewer than the number of swapped molecules observed within multiplexed samples, as shown in Table 3. This demonstrates the specificity of our molecule exclusion method for removing swapping artefacts in 10X data.
frac_4000 <- apply(reads_4000, 1, max)/Matrix::rowSums(reads_4000)
dominant <- reads_4000[frac_4000 >= 0.8 & frac_4000 != 1,]
swapped_reads_dom = sum(apply(dominant, 1, function(x) sum(x[-which.max(x)])))
other <- reads_4000[frac_4000 < 0.8,]
swapped_4000 <- sum(other) + swapped_reads_dom
rate_4000 <- swapped_4000/sum(reads_4000)
frac_2500 <- apply(reads_2500, 1, max)/Matrix::rowSums(reads_2500)
dominant <- reads_2500[frac_2500 >= 0.8 & frac_2500 != 1,]
swapped_reads_dom = sum(apply(dominant, 1, function(x) sum(x[-which.max(x)])))
other <- reads_2500[frac_2500 < 0.8,]
swapped_2500 <- sum(other) + swapped_reads_dom
rate_2500 <- swapped_2500/sum(reads_2500)
Using this framework, we can estimate a swapping fraction for this 10X data by considering the fraction of all reads that we deem to have swapped. Our swap-identification method provides a swapping fraction of 0.978% on the HiSeq 4000, and 0.0139% on the HiSeq 2500. This is comparable to the swapping fraction we estimated for Smart-seq2 data, with a modest discrepency attributable to differences between the assays (Costello et al. 2017). Nonetheless, consistent with our previous results, we observe an order of magnitude difference between swapping fractions on each technology.
One proposed solution for the swapping problem are unique-at-both-ends indices. In these experiments, a cDNA molecule in a given cell’s library is indexed with a unique barcode at each end. These barcodes are never reused for any other cell library in the same multiplexed set. A single barcode swap therefore moves a sequencing read into an unused barcode combination, not into another cell library. This is an effective solution for the multiplexing of a relatively low number of libraries for sequencing (e.g., Nugen provide sets of 96 pairs of indexes).
However, for single-cell RNA-seq, it may be desirable to sequence many hundreds of multiplexed samples (i.e., cells) together, particularly as sequencing facilities transition towards the use of higher-throughput machines. Use of unique-at-both-ends indices may not be feasible in these experiments, because of the loss of combinatorial complexity provided by the reuse of barcodes in a multiplexed set of libraries. Consider the situation where we have \(\zeta\) unique barcode sequences. If barcodes are reused between cells in unique combinations, the maximum number of samples that can be multiplexed together is: \[\left(\frac{\zeta}{2}\right)^2 \;,\] assuming that \(\frac{\zeta}{2}\) barcodes are exclusively used for 5’ or 3’ indexing. In contrast, the maximum number of combinations for unique-at-both-ends indexing is: \[\frac{\zeta}{2} \;.\] Clearly, unique-at-both-ends indexing severely restricts the throughput of multiplexing strategies.
In practice, barcodes are often used in a 96-well plate (of dimension 12 x 8). For every additional 20 barcodes that are available, 8 are used to index rows (say, 5’ indexing), and 12 are used to index columns (3’ indexing). Here, the number of possible barcode combinations is: \[\frac{8\zeta}{20}\times\frac{12\zeta}{20} = \frac{6\zeta^2}{25}\]
Again, this approach allows the multiplexing of very many more samples than unique-at-both-ends approaches. The difference in scaling of these values is illustrated in Figure 42.
nbc = seq(2, 200, 2)
pdf = data.frame(barcodes = nbc, unique = nbc/2, shared = (nbc/2)^2, plate = 6/25*nbc^2)
plot = ggplot(pdf) +
geom_line(aes(x = barcodes, y = unique, colour = "unq"), lwd = 2) +
geom_line(aes(x = barcodes, y = shared, colour = "rep"), lwd = 2) +
geom_line(aes(x = barcodes, y = plate, colour = "plate"), lwd = 2) +
theme_bw() +
labs(x = "Number of available barcodes", y = "Number of possible multiplexed libraries") +
scale_color_manual(values = c("unq" = "coral3", "rep" = "seagreen3", "plate" = "purple"), labels = c("unq" = "Unique-at-both-ends", "rep" = "Shared barcodes, NxN scheme", "plate" = "Shared barcodes, 96-well plates"), name = "") +
theme(legend.position = c(0.3, 0.8))
print(plot)
Figure 42: Maximum number of libraries that can be multiplexed under different labelling schemes, as a function of the number of available barcodes
Use of unique-at-both-ends barcoding is particularly problematic for methods such as sci-Seq (Cao et al. 2017), where the reuse of barcodes between cells generates the combinatorial complexity that allows massively high-throughput generation of cell libraries. Additionally, the use of droplet-based protocols (e.g. 10X Genomics) is increasingly popular. In these experiments, samples are labelled with a single index, and so are not compatible with the unique-at-both-ends indexing approach.
While we would encourage the use of experimental designs with unique-at-both-ends indexing where possible, it is clearly not a suitable approach for all experiments. This motivates our development of computational methods to address barcode swapping, particularly for droplet data.
Below is the code used to generate the figures used in the manuscript.
model = lm(conc_df$swap_rate ~ I(conc_df$bc_mol/conc_df$seq_mol))
print(summary(model))
##
## Call:
## lm(formula = conc_df$swap_rate ~ I(conc_df$bc_mol/conc_df$seq_mol))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.027110 -0.009568 0.000691 0.009705 0.033983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.012172 0.009858 1.235 0.237
## I(conc_df$bc_mol/conc_df$seq_mol) 0.004294 0.002664 1.612 0.129
##
## Residual standard error: 0.01689 on 14 degrees of freedom
## Multiple R-squared: 0.1565, Adjusted R-squared: 0.09625
## F-statistic: 2.598 on 1 and 14 DF, p-value: 0.1293
conc = ggplot(conc_df, aes(x = bc_mol/seq_mol, y= swap_rate)) +
geom_smooth(method = "lm", fullrange = TRUE, col = "darkgrey", fill = "grey80") +
geom_point() +
annotate("text", x= 6, y = 0.01, label = paste0("italic(R)^2 == ", format(summary(model)$r.squared, digits = 2)), parse = TRUE, fontface = "bold", size = 7) +
theme_bw() +
scale_x_continuous(name = "Free barcode:Sequenced cDNA ratio") +
scale_y_continuous(name = "Estimated swapping fraction", breaks = c(seq(0, 0.08, 0.02)), labels = paste0(seq(0,8,2), "%"), limits = c(-0.0, 0.09)) +
theme(panel.grid = element_blank(),
axis.title = element_text(face = "bold", size = 14),
axis.text = element_text(face = "bold", size = 10, colour = "black"))
print(conc)
#get library size
get.libsizes = function(expr.matrix){
test = emptyDrops(expr.matrix, niters = 30000, ignore = 1000)
test = test[!is.na(test$FDR),]
sig = test$FDR <= 0.01 & test$Total >= 1000
return(test$Total[sig])
}
libs = lapply(seq_along(cleaned_4000$cleaned),
function(i) get.libsizes(cleaned_4000$cleaned[[i]] + cleaned_4000$swapped[[i]]))
libs = lapply(seq_along(libs), function(x) data.frame(libs = libs[[x]], sample = sample_map[x]))
libs = do.call(rbind, libs)
#libsize
libsize = ggplot(libs, aes(x = factor(sample, levels = sample_map[order(sample_map)]), y = as.numeric(libs), fill = substr(sample, 1, 1))) +
geom_violin() +
scale_y_log10(labels = c("1,000", "10,000", "100,000"), breaks = 10^(3:5)) +
theme_bw() +
theme(legend.position = "none",
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10, face = "bold", colour = "black"),
axis.text.x = element_text(size = 12, face = "bold", colour = "black"),
axis.title = element_text(face = "bold", size = 15)) +
labs(x = "Sample", y = "# UMIs") +
scale_fill_brewer(palette = "Set2")
print(libsize)
cells_df = melt(cellcall)
## Using sample as id variables
#CELL CALLING
cells = ggplot(cells_df, aes(x = sample, fill = variable, y = value)) +
geom_bar(stat = "identity", position = "dodge") +
theme_bw() +
scale_fill_manual(values = c("#665178", "#A9CDC3"), name = "", label = c("All molecules", "Swapped molecules\nremoved")) +
labs(x = "Sample", y = "# cells called") +
theme(legend.position = c(0.3, 0.8),
legend.title = element_blank(),
legend.background = element_rect(fill=alpha('white', 1), colour = "black"),
legend.text = element_text(size = 10, colour = "black", face = "bold"),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10, face = "bold", colour = "black"),
axis.text.x = element_text(size = 12, face = "bold", colour = "black"),
axis.title = element_text(face = "bold", size = 15)) +
scale_y_continuous(breaks = c(0,500, 1000), labels = c("0", "500", "1,000"))
print(cells)
swap_fix = plot_grid(libsize, cells, align = "hv", ncol = 1, labels = c("D", "E"))
master_plot = plot_grid(NULL, conc, NULL, swap_fix, nrow = 2, labels = c("A", "B", "C"), scale = 10/11)
save_plot(plot = master_plot, filename = paste0(folder_location, "/figs/cowplot_out.pdf"), useDingbats = FALSE, base_width = 11, base_height = 11)
save(conc_df, libs, cells_df, sample_map, df_4000, file = paste0(folder_location, "/figs/plot_dfs.RData"))
Bach, Karsten, Sara Pensa, Marta Grzelak, James Hadfield, David J. Adams, John C. Marioni, and Walid T. Khaled. 2017. “Differentiation Dynamics of Mammary Epithelial Cells Revealed by Single-Cell RNA Sequencing.” Nature Communications 8 (1): 2128. doi:10.1038/s41467-017-02001-5.
Cao, Junyue, Jonathan S. Packer, Vijay Ramani, Darren A. Cusanovich, Chau Huynh, Riza Daza, Xiaojie Qiu, et al. 2017. “Comprehensive Single-Cell Transcriptional Profiling of a Multicellular Organism.” Science (New York, N.y.) 357 (6352): 661–67. doi:10.1126/science.aam8940.
Costello, Maura, Mark Fleharty, Justin Abreu, Yossi Farjoun, Steven Ferriera, Laurie Holmes, Tom Howd, et al. 2017. “Characterization and Remediation of Sample Index Swaps by Non-Redundant Dual Indexing on Massively Parallel Sequencing Platforms.” bioRxiv, October, 200790. doi:10.1101/200790.
Illumina. 2017. “Effects of Index Misassignment on Multiplexing and Downstream Analysis.” [White Paper]. Accessed June 21. https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf?linkId=36607862.
Liao, Yang, Gordon K. Smyth, and Wei Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Research 41 (10): e108. doi:10.1093/nar/gkt214.
———. 2014. “featureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics (Oxford, England) 30 (7): 923–30. doi:10.1093/bioinformatics/btt656.
Marioni, John C., Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. 2008. “RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18 (9): 1509–17. doi:10.1101/gr.079558.108.
Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single Cell Resolution Map of Mouse Haematopoietic Stem and Progenitor Cell Differentiation.” Blood, January, blood–2016–05–716480. doi:10.1182/blood-2016-05-716480.
Picelli, Simone, Omid R. Faridani, Asa K. Bjorklund, Gosta Winberg, Sven Sagasser, and Rickard Sandberg. 2014. “Full-Length RNA-Seq from Single Cells Using Smart-Seq2.” Nature Protocols 9 (1): 171–81. doi:10.1038/nprot.2014.006.
Sinha, Rahul, Geoff Stanley, Gunsagar Singh Gulati, Camille Ezran, Kyle Joseph Travaglini, Eric Wei, Charles Kwok Fai Chan, et al. 2017. “Index Switching Causes ‘Spreading-of-Signal’ Among Multiplexed Samples in Illumina HiSeq 4000 DNA Sequencing.” bioRxiv, April. doi:10.1101/125724.
Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (January): 14049. http://dx.doi.org/10.1038/ncomms14049.